airbnb_df[,c("host_response_rate",
"bathrooms",
"weekly_price",
"monthly_price",
"cleaning_fee",
"security_deposit",
"guests_included",
"extra_people",
"review_scores_rating")] <- NULLIt is necessary to convert the data type of some categorical variables of interest to factor.
airbnb_df$neighbourhood_cleansed <- as.factor(airbnb_df$neighbourhood_cleansed)
airbnb_df$neighbourhood <- as.factor(airbnb_df$neighbourhood)
airbnb_df$property_type <- as.factor(airbnb_df$property_type)
airbnb_df$room_type <- as.factor(airbnb_df$room_type)
airbnb_df$bed_type <- as.factor(airbnb_df$bed_type)
airbnb_df$cancellation_policy <- as.factor(airbnb_df$cancellation_policy)Source: Detailed Listings data for Washington, D.C. from Inside Airbnb (http://insideairbnb.com/get-the-data.html)
| Variable | Type | Description |
|---|---|---|
| host_id | num | Host identification number |
| host_name | char | Name of host |
| neighbourhood_cleansed | factor | Property’s neighborhood group |
| neighbourhood | factor | Property’s neighborhood |
| zipcode | char | Property’s zipcode |
| latitude | num | Latitude coordinate of property |
| longitude | num | Longitude coordinate of propert |
| property_type | factor | Type of property |
| room_type | factor | Type of room |
| accommodates | num | Number of people the property can accommodate |
| bedrooms | num | Number of available bedrooms |
| beds | num | Number of available beds |
| bed_type | factor | Type of bed |
| price | num | Listing price |
| minimum_nights | num | Minimum of night per stay |
| availability_365 | num | Property’s availaility in the next 365 days |
| availability_30 | num | Property’s availaility in the next 30 days |
| availability_60 | num | Property’s availaility in the next 60 days |
| availability_90 | num | Property’s availaility in the next 90 days |
| reviews_per_month | num | Number of reviews per month |
| cancellation_policy | factor | Cancellation policy |
## # A tibble: 1 x 1
## count
## <int>
## 1 2000
## [1] 1.040626
There are 2000 missing values (1.04%) in this dataset recognized by R as NA.
Question 1: What are the most common Airbnb properties in D.C.? What is the variation in price for different types of property?
Question 2: How does location influence property rental price?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.0 79.0 115.0 193.1 187.2 10000.0
## [1] 304.5666
prop_type <- airbnb_df %>% group_by(property_type) %>% summarise(average_price=mean(price,na.rm = TRUE),
min_price=min(price, na.rm=TRUE),
max_price=max(price, na.rm = TRUE),
std=sd(price,na.rm = TRUE),
min_night=min(minimum_nights,na.rm=TRUE),
accom=median(accommodates, na.rm=TRUE),
range=max(price, na.rm=TRUE)-min(price,na.rm=TRUE))
head(arrange(prop_type, average_price, min_night),3)## # A tibble: 3 x 8
## property_type average_price min_price max_price std min_night accom
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Hostel 45.6 33 150 31.5 1 6
## 2 Bungalow 92 30 325 70.5 1 3
## 3 Guest suite 101. 30 1500 74.2 1 3
## # … with 1 more variable: range <dbl>
## # A tibble: 3 x 8
## property_type average_price min_price max_price std min_night accom
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Serviced apa… 266. 100 569 101. 1 5
## 2 Dome house 300 300 300 NA 3 10
## 3 Resort 500 500 500 NA 3 4
## # … with 1 more variable: range <dbl>
From the visualization, apartment is the most common property on Airbnb. Now let’s look at the frequency table of the property type to confirm:
table_prop <- as.data.frame(table(airbnb_df$property_type)/length(airbnb_df$property_type)*100)
head(arrange(table_prop,desc(Freq)),10)## Var1 Freq
## 1 Apartment 46.0008741
## 2 House 21.2084790
## 3 Townhouse 15.4173951
## 4 Condominium 8.2604895
## 5 Guest suite 5.6927448
## 6 Serviced apartment 0.6555944
## 7 Loft 0.6446678
## 8 Bed and breakfast 0.6228147
## 9 Guesthouse 0.5135490
## 10 Other 0.2513112
Let’s plot some boxplots to have a further insight into the price of each neighborhood:
On average, airbnb users visiting DC should expect to pay $115 per night. The most expensive accomodation costs $10,000 for a minimum of four nights, and the cheapest option costs only $10 for a night. The three most affordable airbnbs are hostel ($46/night), bungalow ($90.00/night), and guest suite ($100/night) with standard deviation of $32, $70, and $74 respectively. On other other hand, resort is the most expensive lodging with indeterminate standard deviation since there is only one listing of this property type. From the visualization, the most common listing in DC with fairly low price is apartments followed by townhouses, single homes, and condominiums. Apartments account for 46% of the total listing, whereas only 15% of listed property are hostels. Looking at the “Price Variation of Property Type” graph, the single house category has the greatest variation in price with the highest price outlier.
To determine the Airbnb price range for D.C. neighborhoods, we first look at the standard deviation visualization. Georgeotown, Burleith-Hillandale area has a greatest dispersion in price. This is due to an outlier - the $10,000 Historic Georgetown Residence. In contrast, Sheridan, Barry Farm, Buena Vista neighborhood has the lowest price’s standard deviation. Furthermore, Downtown, Chinatown, Penn Quarters area has the highest median price. Columbia Heights-Mt.Pleasant and Cathedral Heights appear to have the same price range with Cathedral Heights having a slightly cheaper median price.
In case you are bored, here is an interactive map for detailed listing of Airbnb properties in D.C.
Question: Do townhouses have a higher average price compared to condominiums?
A two sample t-test will be performed at the 95% confidence level.
| Null hypothesis | Alternative hypothesis |
|---|---|
| The average price of townhouses is equal to condos | The average price of townhouses is higher than condos. |
| \(H_{o}:\mu_{T} = \mu_{C}\) | \(H_{a}:\mu_{T} > \mu_{C}\) |
townhouse <- airbnb_df %>% filter(property_type == "Townhouse")
condo <- airbnb_df %>% filter(property_type == "Condominium")
t.test(townhouse$price, condo$price, alternative="greater", conf.level = 0.95)##
## Welch Two Sample t-test
##
## data: townhouse$price and condo$price
## t = 0.94315, df = 1516.4, p-value = 0.1729
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -6.361168 Inf
## sample estimates:
## mean of x mean of y
## 180.9809 172.4431
P-value: 0.1729 > \(0.05 = \alpha\)
Conclusion: Fail to reject null hypothesis.
Real-world interpretation: The difference between the two samples’ means is statistically nonsignificant. There is not enough evidence in our data to prove that the average price of townhouses is higher than condominiums. The price difference we observed in the visualization occurs likely due to chance.
The 95% confidence interval means we can be 95% sure that the 95% confidence interval contains the true difference between the means of these two groups. Here a one-tail confidence interval from -6.36 to \(\infty\) was used. This confidence interval contains 0 which implies that 0 is a reasonable possibility for the true value of the difference. Hence, we fail to reject the null hypothesis.
Question: Is there a relationship between room type and bed type?
We’ll perform a Chi-square test with \(\alpha = 0.05\).
| Null hypothesis | Alternative hypothesis |
|---|---|
| \(H_{o}:\) Room type and bed type are independent. | \(H_{a}:\) Room type and bed type are dependent. |
table_room_bed <- table(airbnb_df$room_type, airbnb_df$bed_type)
result <- chisq.test(table_room_bed)## Warning in chisq.test(table_room_bed): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: table_room_bed
## X-squared = 90.777, df = 12, p-value = 3.491e-14
P-value: 3.491E-14 < \(0.05 = \alpha\).
Conclusion: Reject null hypothesis.
Real-world interpretation: There is enough evidence to show that room type and bed type are related. However, the result may not be valid due to the test’s error.
Question: Assuming I’m a hostel owner, I would like to predict the price depending on the number of beds I have. How does the price per night relate to the number of beds?
Null hypothesis: \(H_{o}:\) There is no correlation between the number of beds and the price.
Alternative hypothesis: \(H_{a}:\) There is correlation between the number of beds and the price.
## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).
##
## Call:
## lm(formula = airbnb_df$price ~ airbnb_df$beds)
##
## Coefficients:
## (Intercept) airbnb_df$beds
## 95.96 50.52
##
## Call:
## lm(formula = airbnb_df$price ~ airbnb_df$beds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2572.1 -96.5 -59.5 -6.0 9651.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95.964 4.942 19.42 <2e-16 ***
## airbnb_df$beds 50.523 2.019 25.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 293.8 on 9137 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.06411, Adjusted R-squared: 0.06401
## F-statistic: 625.9 on 1 and 9137 DF, p-value: < 2.2e-16
The linear model is: \[price = 50.52\ *\ number\ of\ beds\ +\ 95.96\]
P-value: 2.2E-16 < 0.05 = \(\alpha\)
Our model is statistically significant, and there is a relationship between the number of beds and the price.
## [1] 0.2532029
With \(r^2 = 0.06401\), we understand that 6.4% of variation in the price is due to the the number of beds.
\(r = 0.2532\) indicates a weak positive correlation between the two variables.
There are some limitations I encounterred while analyzing the dataset. The Airbnb data from Inside Airbnb was last retrieved on November 22 in 2019, so the information will be solely based on what have been scraped from Airbnb website on that date. In addition, historical data for the property prices are not available in this dataset.
While performing Chi-squared test for the two categorical variables room_type and bed_type, the test results came with a warning “Chi-squared approximation may be incorrect”. This refers to the small expected counts of the varibles in the dataset; hence, the approximation may be poor. Since the p-value is relatively small compared to alpha, the null hypothesis was rejected.